Hybrid focused crawling on the Surface and the Dark Web

نویسندگان

  • Christos Iliou
  • George Kalpakis
  • Theodora Tsikrika
  • Stefanos Vrochidis
  • Yiannis Kompatsiaris
چکیده

Focused crawlers enable the automatic discovery of Web resources about a given topic by automatically navigating through the Web link structure and selecting the hyperlinks to follow by estimating their relevance to the topic of interest. This work proposes a generic focused crawling framework for discovering resources on any given topic that reside on the Surface or the Dark Web. The proposed crawler is able to seamlessly navigate through the Surface Web and several darknets present in the Dark Web (i.e., Tor, I2P, and Freenet) during a single crawl by automatically adapting its crawling behavior and its classifier-guided hyperlink selection strategy based on the destination network type and the strength of the local evidence present in the vicinity of a hyperlink. It investigates 11 hyperlink selection methods, among which a novel strategy proposed based on the dynamic linear combination of a link-based and a parent Web page classifier. This hybrid focused crawler is demonstrated for the discovery of Web resources containing recipes for producing homemade explosives. The evaluation experiments indicate the effectiveness of the proposed focused crawler both for the Surface and the Dark Web.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Prioritize the ordering of URL queue in Focused crawler

The enormous growth of the World Wide Web in recent years has made it necessary to perform resource discovery efficiently. For a crawler it is not an simple task to download the domain specific web pages. This unfocused approach often shows undesired results. Therefore, several new ideas have been proposed, among them a key technique is focused crawling which is able to crawl particular topical...

متن کامل

From Focused Crawling to Expert Information: an Application Framework for Web Exploration and Portal Generation

Focused crawling is a relatively new, promising approach to improving the recall of expert search on the Web. It typically starts from a useror communityspecific tree of topics along with a few training documents for each tree node, and then crawls the Web with focus on these topics of interest. This process can efficiently build a theme-specific, hierarchical directory whose nodes are populate...

متن کامل

A Novel Hybrid Focused Crawling Algorithm to Build Domain-Specific Collections

The Web, containing a large amount of useful information and resources, is expanding rapidly. Collecting domain-specific documents/information from the Web is one of the most important methods to build digital libraries for the scientific community. Focused Crawlers can selectively retrieve Web documents relevant to a specific domain to build collections for domain-specific search engines or di...

متن کامل

Focused Crawling: A Means to Acquire Biological Data from the Web

Experience paper. World Wide Web contains billions of publicly available documents (pages) and it grows and changes rapidly. Web search engines, such as Google and Altavista, provide access to indexable Web documents. An important part of a search engine is a Web crawler whose function is to collect Web pages for the search engine. Due to the Web’s immense size and dynamic nature no crawler is ...

متن کامل

Ontology Based Approach for Services Information Discovery using Hybrid Self Adaptive Semantic Focused Crawler

Focused crawling is aimed at specifically searching out pages that are relevant to a predefined set of topics. Since ontology is an all around framed information representation, ontology based focused crawling methodologies have come into exploration. Crawling is one of the essential systems for building information stockpiles. The reason for semantic focused crawler is naturally finding, comme...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • EURASIP J. Information Security

دوره 2017  شماره 

صفحات  -

تاریخ انتشار 2017